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ABSTRACT 

In this paper we study Probability Measures (PM) from a functional point of 
view: we show that PMs can be considered as functionals (generalized func¬ 
tions) that belong to some functional space endowed with an inner product. 
This approach allows us to introduce a new family of distances for PMs, based 
on the action of the PM functionals on ‘interesting’ functions of the sample. 
We propose a specific (non parametric) metric for PMs belonging to this class, 
based on the estimation of density level sets. Some real and simulated data 
sets are used to measure the performance of the proposed distance against a 
battery of distances widely used in Statistics and related areas. 



1 Introduction 


Probability metrics, also known as statistical distances, are of fnndamental importance in 
Statistics. In essence, a probability metric it is a measnre that qnantifies how (dis)similar 
are two random quantities, in particular two probability measures (PM). Typical examples 
of the use of probability metrics in Statistics are homogeneity, independence and goodness of 
£t tests. For instance there are some goodness of £t tests based on the use of the distance 
and others that use the Kolmogorov-Smirnoff statistics, which corresponds to the choice of 
the supremum distance between two PMs. There exist a large literature about probability 
metrics, for a summary review on interesting probability metrics and theoretical results refer 
to (Deza and Deza, 2009; Muller, 1997 Zolotare^ 1983) and references therein. 

Statistical distances are also extensively used in several applications related to Machine 
Learning and Pattern Recognition. Several examples can be found, for instance, in Clustering 


Nielsen and Boltz 

2011; 

Banerjee et al. 

2005 

), Image Analysis ( 

Levina and Bickel 

2001 


Rubner et ah, 2000), Bioinfomatics (Minas et ah, 2013 Saez et ah, 2013), Time Series 


Analysis ( |Ryabko and Mary , 2012, Moon et ah, 1995) or Text Mining (Lebanon, 2006), just 
to name a few. 

In practical situations we do not know explicitly the underlying distribution of the data 
at hand, and we need to compute a distance between probability measures by using a hnite 
data sample. In this context, the computation of a distance between PMs that rely on the 
use of non-parametric density estimations often is computationally difficult and the rate of 


convergence of the estimated distance is usually slow (Nguyen et ah, 2007 Wang et al. 


2005 Stone, 1980). In this work we extend the preliminary idea presented in (Munoz et al. 


2012), that consist in considering PMs as points in a functional space endowed with an inner 


product. We derive then different distances for PMs from the metric structure inherited 
from the ambient inner product. We propose particular instances of such metrics for PMs 
based on the estimation of density level sets regions avoiding in this way the difficult task of 
density estimation. 

This article is organized as follows: In Section we review some distances for PMs and 
represent probability measures as generalized functions; next we dehne general distances 
acting on the Schwartz distribution space that contains the PMs. Section presents a new 
distance built according to this point of view. Section illustrates the theory with some 
simulated and real data sets. Section |5] concludes. 
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2 Distances for probability distributions 


Several well known statistical distances and divergence measnres are special cases of /- 
divergences (Csiszar and Shields, 2004). Consider two PMs, say P and Q, defined on a 


measnrable space (X, X, /i), where X is a sample space, X a a-algebra of measnrable subsets 
of X and /r ; X —?■ IR’*' the Lebesgue measure. For a convex function / and assuming that P 
is absolutely continuous with respect to Q, then the /-divergence from P to Q is dehned by: 


X 


( 2 . 1 ) 


Some well known particular cases: for f(t) = we obtain the Total Variation metric; 
fit) = it — 1)^ yields the x^"distance; fit) = {\/i — 1)^ yields the Hellinger distance. 

The second important family of dissimilarities between probability distributions is made 
up of Bregman Divergences: Consider a continuously-differentiable real-valued and strictly 
convex function ip and dehne: 




= / Mp) - pip) - (p - pWip)) 


( 2 . 2 ) 


X 


where p and q represent the density functions for P and Q respectively and p'{q) is the 


derivative of p evaluated at q (see (Frigyik at al., 2008; Cichocki and Amari, 2010) for further 


details). Some examples of Bregman divergences: pif) = f^ d^iF,Q) yields the Euclidean 
distance between p and q (in L 2 ); pif) = t log(f) yields the Kullback Leibler (KL) Divergence; 
and for pit) = — log(f) we obtain the Itakura-Saito distance. In general df and d^ are not 
metrics because the lack of symmetry and because they do not necessarily satisfy the triangle 
inequality. 


A third interesting family of PM distances are integral probability metrics (IPM) (Zolotarev 
1983 Muller, 1997[ ). Consider a class of real-valued bounded measurable functions on X, 
say TL, and dehne the IPM between P and Q as 


dn{^,Q) = sup / fdF- 1 
f&n J J 

If we choose Ti as the space of bounded functions such that h G "H if 


(2.3) 


< 1, then 


is the Total Variation metric; when TL = {nil l[(-oo,xp] : X = (xi, ...,Xd) G dn is 

the metric computes the maximum 


> 


: u G 


the Kolmogorov distance; if X = 
difference between characteristics functions. In (Sriperumbudur et ah, 2010) the authors 


propose to choose X as a Reproducing Kernel Hilbert Space and study conditions on TL to 
obtain proper metrics dy,. 
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In practice, the obvious problem to implement the above described distance functions is 
that we do not know the density (or distribution) functions corresponding to the samples 
under consideration. For instance suppose we want to estimate the KL divergence (a par¬ 


ticular case of Eq. (2.1) taking f{t) = — logt) between two continuous distributions P and 
Q from two given samples. In order to do this we must choose a number of regions, N, 
and then estimate the density functions for P and Q in the N regions to yield the following 
estimation: 


N 

KL(p,Q) = y^paog?, 

Qi 


i=l 


( 2 . 4 ) 


see further details in (Boltz et ah, 2009) 


As it is well known, the estimation of general distribution functions becomes intractable 
as dimension arises. This motivates the need of metrics for probability distributions that do 
not explicitly rely on the estimation of the corresponding probability/distribution functions. 
For further details on the sample versions of the above described distance functions and 


their computational subtleties see ( 

Scott 

1992 

Cha 

2007 

Wang et ah 

2005 

Nguyen et al. 

2007 

Sriperumbudur et ah 

2010 

Goria et ah 

2005 

Szekely and Rizzo 

2004) 

and references 


therein. 

To avoid the problem of explicit density function calculations we will adopt the perspec¬ 


tive of the generalized function theory of Schwartz (see (Zemanian, 1965), for instance), 
where a function is not specihed by its values but by its behavior as a functional on some 
space of testing functions. 


2.1 Probability measures as Schwartz distributions 

Consider a measure space (X, /r), where X is a sample space, here a compact set0 of a 

real vector space: X C M'^, X a cr-algebra of measurable subsets of X and /i : X —?■ IR'*' 
the ambient cr-additive measure (here the Lebesgue measure). A probability measure P is a 
cr-additive hnite measure absolutely continuous w.r.t. /i that satishes the three Kolmogorov 
axioms. By Radon-Nikodym theorem, there exists a measurable function / : X —IR^ (the 
density function) such that P(A) = and / = ^ is the Radon-Nikodym derivative. 


A PM can be regarded as a Schwartz distribution (a generalized function, see (Strichartz 


1994) for an introduction to Distribution Theory): We consider a vector space V of test 
functions. The usual choice for X is the subset of C°°{X) made up of functions with compact 
support. A distribution (also named generalized function) is a continuous linear functional 
on X. A probability measure can be regarded as a Schwartz distribution P : X —>■ IR by 


not restrictive assumption in real scenarios, see for instance (Moguerza and Munoz 20061. 
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defining P(0) = (P, 0) = / (j)dF = J (j){x)f{x)dj^{x) = When the density function 

f eV, then / acts as the representer in the Riesz representation theorem: P(-) = (•, /). 

In particular, the familiar condition P(X) = 1 is equivalent to (P, l[x]) = 1, where the 
function l[x] belongs to V, being X compact. Note that we do not need to impose that 
f ET>; only the integral (0, /) should be properly dehned for every (f) eV. 

Hence a probability measure/distribution is a continuous linear functional acting on a 
given function space. Two given linear functionals Pi and P 2 will be identical (similar) 
if they act identically (similarly) on every (p E V. For instance, if we choose p = Id, 
Pl(0) = = f xdP = /ip^ and if Pi and P 2 are ‘similar’ then /ip^ ~ /ipj because Pi and 

P 2 are continuous functionals. Similar arguments apply for variance (take 4>{x) = {x — p)^) 
and in general for higher order moments. For (j)^{x) = G IR, we obtain the Fourier 

transform of the probability measure (called characteristic functions in Statistics), given by 

P{^) = (P, = J e'^^^dF. 

Thus, two PMs can be identified with their action as functionals on the test functions 
if the set of test functions V is rich enough and hence, distances between two distributions 
can be dehned from the differences between functional evaluations for appropriately chosen 
test functions. 

Definition 1. (Identification of PM’s). Let T> be a set of test funetions and P and Q 

two PM’s defined on the measure spaee (X, p), then we say that F = Q on V if: 

(P,0) = (Q,0) WfiEV. 

The key point in our approach is that if we appropriately choose a hnite subset of test 
functions {fii}, we can compute the distance between the probability measures by calculating 
a hnite number of functional evaluations. In the next section we demonstrate that when V 
is composed by indicator functions that indicates the regions where the density remains 
constant, then the set V is rich enough to identify PM. In the next section we dehne a 
distance based on the use of this set of indicator functions. 


3 A metric based on the estimation of level sets 

We choose V as CfiX), the space of all compactly supported, piecewise continuous functions 
on X (compact), as test functions (remember that CfiX) is dense in Lp). Given two PMs P 
and Q, we consider a family of test functions {4>i}i^i C V and then dehne distances between 
P and Q by weighting terms of the type d ((P, pi) , (Q, pi)) for i E I, where d is some distance 
function. Our test functions will be indicator functions of a-level sets, described below. 
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Given a PM P with density function /p, minimum volume sets are defined by S'q,(/p) = 

{x G X\ f^{x) > a}, such that P{Sa{fp)) = 1 — where 0 < z/ < 1. If we consider 

an ordered sequence 0 < ai < ... < am, then C SaXM- Let us dehne the 

ttj-level set: Aj(P) = S'q,. (/p) — 5'q,^_,_i(/p), i G {1,... ,m — 1}. We can choose ai ~ 0 and 

am > maxa-gx/p(a^) (which exists, given that X is compact and /p piecewise continuous); 
then IJj A(1P) — Supp(P) = {x G X| fp{x) ^ 0} (equality takes place when m —)■ cx), ai —)■ 0 
and am —t max^^x /p(a^))- Given the dehnition of the Ai, if Aj(P) = Ai(Q) for every i when 
m —)■ oo, then P = Q. We formally prove this proposition with the aid of the following 
theorem. 


Definition 2. (ol^ sequence). Given a PM ¥ defined on the measure spaee {X,P,p,), 
with density funetion /p and m E N, define od^ = {ai, ..., am} where 0 = oi < ... < am = 
max^ff>{x). 

Theorem 1. (a-level set representation of a PM). Given a PM P defined on the 
measure spaee {X,P,fi), with density funetion /p and a sequenee cc™, eonsider the set of 
indicator funetions ^j^p = l[Ai(P)] : X —)■ {0,1} of the a-level sets Ai(P) = Safifp) — >S'q,.^^(/p) 
for i e {1,... ,m - 1}. Define fm{x) = p(a;). Then: 


lim fm{x) = fp{x), 

m^oo 

where the convergence is pointwise almost everywhere. Moreover, as the sequence fm is 
monotonically increasing (fm-i < fm), by Dini’s Theorem, the convergence is also uniform 
(converge uniformly almost everywhere). 

Corollary 1. (a-level sets identification of PMs). If the set of test functions T> 
contains the indicator functions of the a-level sets, then V is rich enough to discriminate 
among PMs. 


Now we elaborate on the construction of a metric that is able to identify PM. Denote 
by to the set of probability distributions on X and given a suitable sequence of non¬ 
decreasing values {ai}fLi, define: I0x ^ D '■ 0i(lP) = l[Ai(P)]- We propose distances of the 
form (‘(’4 ^))'!^z(Q))- Gonsider, as an example, the measure of the standardized 

symmetric difference: 


d(0,(P),0,(Q)) 


ia{A{¥)AA{Q)) 

/i(kl,(P)Ukl,(Q))‘ 


This motivates the dehnition of the a-level set semi-metric as follows. 


Definition 3. (Weighted a-level set semi-metric). Given m E N, consider two se¬ 
quences: odf and (3 q, for P and Q respectively. Then define a family of weighted a-level set 
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distances between P and Q by 


m—1 


da,is{^d 


^w;id(0i(P),0i(Q)) 

i=l 

i=l 


(3.1) 


li(Ai(,P)uA, 


where Wi..., Wm-i € P"*" and fi is the ambient measure. 


Equation (3.1) can be interpreted as a weighted sum of Jaccard distances between the 
y4j(P) and A(Q) sets. For m 0, when P cs Q, then da,i 3 {^, Q) 0 since /i (A(IP) A Ai 


0 for all i G (assume |/p(a;) — /Q(a;)| < e for all x, since /p then 

/x(A(P) A4 


£—^0 


£—^0 


0 Vi, because otherwise contradicts the fact that /p = 


Proposition 1. (Convergence of the a-level set semi-metric to a metric). da,i 3 (f‘,Q.) 
converges to a metric when m —>■ cx). 


The semi-metric proposed in Eq. (3.1) obeys the following properties: is non-negative, 
that is da,/ 3 (P 5 Q) > 0 and lim da,i 3 (P,Q) = 0 if and only if P = Q. For hxed pairs 
(q:,P) and (/3, Q) it is symmetric (icj^^(P, Q) = (i/ 3 ,a(Q, P)- Therefore constitutes a proper 


metric when m —)■ oo. The semi-metric proposed in Eq. (3.1) is invariant under affine 


transformations (see the Appendix B for a formal proof). In section 3.2 we will propose a 
weighting scheme for setting the weights 


Of course, we can calculate da,i 3 in Eq. (3.1) only when we know the distribution function 
for both PMs P and Q. In practice there will be available two data samples generated from 
P and Q, and we need to dehne some plug in estimator: Consider estimators Aj(P) = 
Saiifw) — (details in subsection 3.1), then we can estimate (ia,/ 3 (P, Q) by 


^ g(Ai{F) AAi{Q) 

da,/3(P, Q) = V' Wi -:- 

i=i /i (Ai(P) U Aj(Q) 


(3.2) 


It is clear that g ^A(P) 0 y4i(Q) j equals the total number of points in Aj(P) U y4j(Q), say 
(( ^Ai(P) U Aj(Q)^. Regarding the numerator in Eq. 
to facilitate the notation, and the corresponding sample estimates A and B, one is tempted 
to estimate g{A A B), the area of region A A R, by /i(A A B) = (({A — R) U i({B — A) = 
(({AVJ B) — (({Ar\ B). However this is incorrect since probably there will be no points in 
common between A and R (which implies A A R = A U R). 

In our particular case, the algorithm in Table 1 shows that Aj(P) is always a subset of 
the sample sp drawn from the density function /p, and we will denote this estimation by 
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(3.2), given two level sets, say A and R 














(a) 


(b) 


(c) 


Figure 1: Set estimate of the symmetric difference, (a) Data samples (red) and sb (blue), 
(b) Sb - Covering A: blue points, (c) sa - Covering 13: red points. Blue points in (b) plus 
red points in (c) are the estimate of A A B. 


s^.(P) from now on. We will reserve the notation Aj(P) for the covering estimation of Aj(P) 
dehned by U^B{xj,rA) where Xj G B{xj,rA) are closed balls with centres at Xj and 

(hxed) radius va (Devroye and Wise, 1980). The radius is chosen to be constant (for data 
points in v4j(P)) because we can assume that density is approximately constant inside region 
Aj(P), if the partition of the set is hne enough. For example, in the experimental 

section, we £x as the median distance between the points that belongs to the set 

To illustrate this notation we include Figure In Figure (a) we show two data different 
samples from a-level sets A and B: sa (red points) and sb (blue points), respectively. In 
Figure Sb) A is the covering estimation of set A made up of the union of balls centered in 


the data red points sa- That is A = U^B{xj,rA) A. Figure 111 (c) can be interpreted 

n^oo I_I 




equivalently regarding the covering of the sample sb- The problem of calculating fi{A A B) 
thus reduces to estimate the number points in 13 not belonging to the covering estimate of 
A, plus the number points in A not belonging to the covering estimate of B. To make the 
computation explicit consider x E A,y E B and dehne 


^rAXB {x,y) ^[B{x,rA)]i.y) T l[B(y,rs)]()r) 

~ '^[B{x,rA)]{y)'^[B{y,rB)]{^)j 

where IrA,rB{^j ?/) = 1 when y belongs to the covering A, x belongs to the covering B or both 
events happen. Thus if we dehne 

xeA yeB 
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Estimation of Rn = SQ,(f): 

1 Choose a constant z/ G [0,1]. 

2 Consider the order induced in the sample by the sparsity measure gn{.x), that is, 

< ■ ■ ■ < gn{x(n)), where xp) denotes the sample, ordered after g. 

3 Consider the value p* = g{x(^^n)) if ^ Pn = gn{x(^[^n]+i)) otherwise, where [x] stands 
for the largest integer not greater than x. 

4 Dehne hn{x) = sign{p*^ - p„(a;)). 


Table 1: Algorithm to estimate minimum volume sets {So,{f)) of a density /. 

we are able to estimate the symmetric difference by 

p(A^B) = piAxTB) - piArTB) = #/i(A UR) - /(A, B). 


3.1 Estimation of level sets 

To estimate the set denoted as we implement a One-Class Neighbor Machine approach 


(Munoz and Moguerza, 2006, 2005). The One-Class Neighbor Machine solves the following 


optimization problem: 


n 



max 

unp - ^ 6 

i=l 



s.t. 

g{.Xi) > p-ii, 

e* > 0, * = !,.. 

. ,n 


(3.3) 


where g{x) = M{x,Sn) is a sparsity measure (see the Appendix C for further details), 
u G [0,1] such that P{Sa) = 1 — with i = 1,... ,n are slack variables and p is a 
predehned constant. With the aid of the Support Neighbor Machine, we estimate a density 
contour cluster S'q. (/) around the mode for a suitable sequence of values (note that 

the sequence 0 > = 1 it is in a one-to-one correspondence with the sequence 

0 < cxi < ... < am = niax 3 ;gx/p(a^))- In Table 1 we present the algorithm to estimate 
Sa{f) of a density function /. Hence, we take s^,(p) = ^a,(/p) - 5„,_^j(/p), where ^a,(/p) is 
estimated by Rn dehned in Table 1 (the same estimation procedure applies for 

The computational complexity of the algorithm of Table 1 and more details on the esti- 


Munoz and Moguerza 

2004, 

2005 

2006) 


execution time required to compute SQ,(f) grows at a rate of order O^dri^), where d represent 
the dimension and n the sample size of the data at hand. 
















3.2 Choice of weights for o-level set distances 


In this section we define a weighting scheme for the family of distances defined by Eq. 
(3.1). Denote by sp and sq the data samples corresponding to PMs P and Q respectively, 
and denote by and the data samples that estimate Aj(P) and Aj(Q), respec¬ 
tively. Remember that we can estimate these sets by coverings Aj(P) = ''"ii(P))’ 


A, 




Let m denote the size of the ck™ and /3q sequences. Denote by the number of 


data points in 


i(P)’ 


the number of data points in 


(Q)’ 'Ai(P) 


(P) 

the (fixed) radius for 


the covering Aj(P) and the (fixed) radius for the covering Aj(Q), usually the mean or 

the median distance inside the region Aj(P) and Aj(Q) respectively. We define the following 
weighting scheme: 


Wi 


1 

m 


’^Ai(P) ^AiCQ) 




-L 


'Ai(P)’'Ai(Q) 



x-y \\2 


^*(^)) (5 V(p') A , 


(3.4) 


The weight Wi is a weighted average of distances between a point of and a point 

of where ||a; — y \\2 is taken into account only when 1^^ ~ details 

about the weighting scheme and its extension can be seen in the Appendix D. 


4 Experimental work 


Since the proposed distance is intrinsically nonparametric, there are no simple parameters 
on which we can concentrate our attention to do exhaustive benchmarking. The strategy will 
be to compare the proposed distance to other classical PM distances for some well known 
(and parametrized) distributions and for real data problems. Here we consider distances 


belonging to the main types of PMs metrics: Kullback-Leibler (KL) divergence (Boltz et al. 


2009 [Nguyen et al., 2007) (/-divergence and also Bregman divergence), t-test (T) 


mea¬ 


sure (Hotelling test in the multivariate case). Maximum Mean Discrepancy (MMD) distance 


(Gretton et all 2012 Sriperumbudur et al.l 2010) and Energy distance (Szekely and Rizzo 


(Gretton et aLf 2012 

2004 

Sejdinovic et al. 

jdinovic et al. 

2012)). 


2004 Sejdinovic et ah, 2012) (an Integral Probability Metric, as it is demonstrated in (Se- 
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Table 2: 5*\/d for a 5% type I and 10% type II errors. 


Metric 

d: 1 

2 

3 

4 

5 

10 

15 

20 

50 

100 

KL 

0.870 

0.636 

0.433 

0.430 

0.402 

0.474 

0.542 

0.536 

0.495 

0.470 

T 

0.490 

0.297 

0.286 

0.256 

0.246 

0.231 

0.201 

0.182 

0.153 

0.110 

Energy 

0.460 

0.287 

0.284 

0.256 

0.250 

0.234 

0.203 

0.183 

0.158 

0.121 

MMD 

0.980 

0.850 

0.650 

0.630 

0.590 

0.500 

0.250 

0.210 

0.170 

0.130 

LS(0) 

0.490 

0.298 

0.289 

0.252 

0.241 

0.237 

0.220 

0.215 

0.179 

0.131 

LS(1) 

0.455 

0.283 

0.268 

0.240 

0.224 

0.221 

0.174 

0.178 

0.134 

0.106 


4.1 Artificial data 

4.1.1 Discrimination between normal distributions 


In this experiment we quantify the ability of the considered PM distances to test the null 
hypothesis Hq : P = Q when P and Q are multivariate normal distributions. To this end, 
we generate a data sample of size lOOd from a normal distribution A^(0,ld) = P, where d 
stands for dimension and then we generate 1000 iid data samples of size lOOd from the same 
A^(0, Irf) distribution. Next we calculate the distances between each of these 1000 iid data 
samples and the hrst data sample to obtain the 95% distance percentile denoted as 

Now dehne 5 = dl = d(l,...,l) and increase 6 by small amounts (starting from 0). 
For each <5 we generate a data sample of size lOOd from a A^(0 + <5,1^) = Q distribution. If 
d(P, Q) > dp®^" we conclude that the present distance is able to discriminate between both 
populations (we reject Ho) and this is the value 6* referenced in Table 4.1.1 To track the 
power of the test, we repeat this process 1000 times and £x 6* to the present 6 value if the 
distance is above the percentile in 90% of the cases. Thus we are calculating the minimal 
value 6* required for each metric in order to discriminate between populations with a 95% 
conhdence level (type I error = 5%) and a 90% sensitivity level (type II error = 10%). In 
we report the minimum distance {6*'/d) between distributions centers required 


Table 


4.1.1 


to discriminate for each metric in several alternative dimensions, where small values implies 
better results. In the particular case of the T-distance for normal distributions we can use 
the Hotelling test to compute a p-value to £x the 6* value. 

The data chosen for this experiment are ideal for the use of the T statistics that, in fact, 
outperforms KL and MMD. However, Energy distance works even better than T distance in 
dimensions 1 to 4. The LS(0) distance work similarly to T and Energy until dimension 10. 
The LS(1) distance outperform to all the competitor metrics in all the considered dimensions. 
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Table 3: (1 + a*) for a 5% type I and 10% type II errors. 


Metric 

dim: 1 

2 

3 

4 

5 

10 

15 

20 

50 

100 

KL 

'T 

3.000 

1.700 

1.250 

1.180 

1.175 

1.075 

1.055 

1.045 

1.030 

1.014 

1 

Energy 

1.900 

1.600 

1.450 

1.320 

1.300 

1.160 

1.150 

1.110 

1.090 

1.030 

MMD 

6.000 

4.500 

3.500 

2.900 

2.400 

1.800 

1.500 

1.320 

1.270 

1.150 

LS(0) 

1.850 

1.450 

1.300 

1.220 

1.180 

1.118 

1.065 

1.040 

1.030 

1.012 

LS(1) 

1.700 

1.350 

1.150 

1.120 

1.080 

1.050 

1.033 

1.025 

1.015 

1.009 


In a second experiment we consider again normal populations but different variance- 
covariance matrices. Define as an expansion factor a G M and increase a by small amounts 
(starting from 0) in order to determine the smallest a* required for each metric in order 
to discriminate between the lOOd sampled data points generated for the two distributions: 
N(0,ld) = P and iV(0, (1 -|- o')Id) = Q. If d(P, Q) > we conclude that the present 

distance is able to discriminate between both populations and this is the value (1 -|- a*) 


reported in Table 4.1.1 To make the process as independent as possible from randomness 
we repeat this process 1000 times and £x a* to the present a value if the distance is above 
the 90% percentile of the cases, as it was done in the previous experiment. 

There are no entries in Table 14.1.11 for the T distance because it was not able to distin¬ 
guish between the considered populations in none of the considered dimensions. The MMD 
distance do not show a good discrimination power in this experiment. We can see here 
again that the proposed LS(1) distance is better than the competitors in all the dimensions 
considered, having the LS(0) and the KL similar performance in the second place among the 
metrics with best discrimination power. 


4.1.2 Homogeneity tests 

This experiment concerns a homogeneity test between two populations: a mixture between 
a Normal and a Uniform distribution (P = aN{fi = 1, a = 1) -|- (1 — a)U{a = 1,6 = 8) 
where a = 0.7) and a Gamma distribution (Q = j^shape = 1, scale = 2)). To test the 
null hypothesis: Hq : F = Q we generate two random i.i.d. samples of size 100 from P and 
Q, respectively. Figure shows the corresponding density functions for P and Q. In the 
cases of KL-divergence, T, Energy MMD and LS distances we proceed as in the previous 
experiment, we run a permutation test based on 1000 random permutation of the original 
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Normal-Uniform mixture 



Figure 2: Mixture of a Normal and a Uniform Distribution and a Gamma distribution. 

data in order to compute the p-value. In the case of Kolmogorov-Smirnov, and Wilcoxon 
test we report the p-value given by these tests. Results are displayed in Table Only the 
LS distances are able to distinguish between both distributions. Notice that hrst and second 
order moments for both distribution are quite similar in this case (/ip = 2.05 ~ /xq = 2 and 
(jp = 4.5 ~ ctq = 4) and additionally both distributions are strongly asymmetric, which also 
contributes to explain the failure of those metrics strongly based on the use of the first order 
moments. 

Table 4: Hypothesis test between a mixture of Normal and Uniform distributions and a 
Gamma distribution._ 


Metric 

Parameters 

p-value 

Reject? 

Kolmogorov- S mir nov 


0.281 

No. 

test 


0.993 

No. 

Wilcoxon test 


0.992 

No. 

KL 

A; = 10 

0.248 

No. 

T 


0.342 

No. 

Energy 


0.259 

No. 

MMD 


0.177 

No. 

LS (0) 

m = 10 

0.092 

Yes. 

LS (1) 

m = 10 

0.050 

Yes. 
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Figure 3: Real image (a) and sampled image (b) of a hart in the MPEG7 CE-Shape-1 
database. 


4.2 Two real case-studies 

4.2.1 Shape classification 

As an application of the preceding theory to the held of pattern recognition problem we 


consider the MPEG7 GE-Shape-1 (Latecki et ah, 2000), a well known shape database. We 


select four different classes of objects/shapes from the database: hearts, coups, hammers 
and bones. For each object class we choose 3 images in the following way: 2 standard images 
plus an extra image that exhibit some distortion or rotation (12 images in total). In order 
to represent each shape we do not follow the usual approach in pattern recognition that 
consists in representing each image by a feature vector catching its relevant shape aspects; 
instead we will look at the image as a cloud of points in according to the following 
procedure: Each image is transformed to a binary image where each pixel assumes the value 
1 (white points region) or 0 (black points region) as in Figure (a). For each image i of 
size Ni X Mi we generate a uniform sample of size NiMi allocated in each position of the 
shape image i. To obtain the cloud of points as in Figure (b) we retain only those points 
which fall into the white region (image body) whose intensity gray level are larger than a 
variable threshold hxed at 0.99 so as to yield around one thousand and two thousand points 
image representation depending on the image as can be seen in Figure]^ (b). After rescaling 
and centering, we compute the 12 x 12 image distance matrices, using the LS(1) distance 
and the KL divergence, and then compute Euclidean coordinates for the images via MDS 
(results in Figure]^. It is apparent that the LS distance produces a MDS map coherent with 
human image perception (Eg. (a)). This does not happen for the rest of tested metrics, in 
particular for the KL divergence as it is shown in Figure]^ (b)). 
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Figure 4: Multi Dimensional Scaling representation for objects based on (a) LS(1) and (b) 
KL divergence. 

4.2.2 Testing statistical significance in Microarray experients 

Here we present an application of the proposed LS distance in the held of Bioinformatics. The 
data set we analyze comes from an experiment in which the time to respiratory recovery in 
ventilated post trauma patients is studied. Affymetrix U133+2 micro-arrays were prepared 
at days 0, 1, 4, 7, 14, 21 and 28. In this analysis, we focus on a subset of 46 patients which 
were originally divided into two groups: “early recovery patients” (group Gi) that recovered 
ventilation prior to day seven and “late recovery patients ” (group G 2 ), those who recovered 
ventilation after day seven. The size of the groups is 22 and 26 respectively. 

It is of clinical interest to hnd differences between the two groups of patients. In par¬ 
ticular, the originally goal of this study was to test the association of inhammation on day 
one and subsequent respiratory recovery. In this experiment we will show how the proposed 
distance can be used in this context to test statistical differences between the groups and 
also to identify the genes with the largest effect in the post trauma recovery. 

From the original data set we select the sample of 675 probe sets corresponding to those 
genes whose GO annotation include the term “inflammatory”. To do so we use a query (July 
2012) on the Affymetrix web site (affymetrix.com). The idea of this search is to obtain a 
pre-selection of the genes involved in post trauma recovery in order to avoid working with 
the whole human genome. 

Figure shows the heat map of day one gene expression for the 46 patients (columns) 
over the 675 probe-sets. By using a hierarchical procedure, it is apparent that the two 

^http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE13488 
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Figure 5: Affymetrix U133+2 micro-arrays data from the post trauma recovery experiment. 
On top, a hierarchical cluster of the patients using the Euclidean distance is included. At 
the bottom of the plot the grouping of the patients is shown: 1 for “early recovery” patients 
and 2 for “late recovery” patients. 


main clusters we hnd do not correspond to the two groups of patients of the experiment. 
However, the hrst cluster (on the left hand of the plot) contains mainly patients form the 
“early recovery” group (approx. 65 %) whereas the second cluster (on the right) is mainly 
made up of patients from the “late recovery” group (approx 60%). This lack of balance 
suggests a different pattern of gene expression between the two groups of patients. 

In order to test if statistical differences exists between the groups Gi and G 2 we dehne, 
inspired by (|Hayden et ah 2009), an statistical test based on the LS distance proposed 
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early recovery 


late recovery 




Figure 6: Gene density profiles (in logarithmic scale) of the two groups of patients in the 
sample. The 50 most significant genes were used to calculate the profiles with a kernel 
density estimator. 


in this work. To this end, we identify each patient i with a probability distribution Pj. 
The expression of the 675 genes across the probe-sets are assumed to be samples of such 
distributions. Ideally, if the genes expression does not have any effect on the recovery speed 
then all distributions Pj should be equal {Hq). On the other hand, assume that expression of 
a gene or a group of genes effectively change between “early” and “late” recovery patients. 
Then, the distributions Pj will be different between patients belonging to groups Gi and G 2 


m. 

To validate or reject the previous hypothesis, consider the proposed LS distance (ia(Pj, Pj 
dehned in (3.2) for two patients i and j. Denote by 


and 


Ai 

A 2 


1 

22(22 - 1 ) 


d„(Pj,Pj), 

*j6Gi 


1 

26(26 - 1) 


Y d«(Pi,Pj) 

i,jeG2 


Ai2 — 


1 

22 ■ 26 


Y 4(Pi,Pj), 


the averaged a-level set distances within and between the groups of patients. Using the 
previous quantities we define a distance between the groups Gi and G 2 as 


A* 


Ai2 — 


Ai -|- Aj 


(4.1) 
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(a) Heat-map of the top-50 ranked genes (rows). 
A hierarchical cluster of the patients is included 
on top. The labels of the patients regarding their 
recovery group are detailed at the bottom of the 
plot. 


(b) P-values obtained by the proposed a-level set 
distance based test for different samples of in¬ 
creasing number of genes. 


Figure 7: Heat-map of the 50-top ranked genes and p-values for different samples. 


Notice that if the distributions are equal between the groups then A* will be close to 
zero. On the other hand, if the distributions are similar within the groups and different 
between them, then A* will be large. To test if A* is large enough to consider it statistically 
signihcant we need the distribution of A* under the null hypothesis. Unfortunately, this 
distribution is unknown and some re-sampling technique must be used. In this work we 
approximate it by calculating a sequence of distances A^^^, ■ ■ ■, ^*n) where each A^^^ is the 
distance between the groups Gi and G 2 under a random permutation of the patients. For a 
total of N permutations, then 


p — value 


N 


(4.2) 


where refers to the number of times the condition 0 is satished, is a one-side p-value 
of the test. 

We apply the previous LS distance based test (weighting scheme 1 with 10000 permuta¬ 
tions) using the values of the 675 probe-sets and we obtain a p-value = 0.1893. This result 
suggests that none differences exists between the groups exist. The test for micro-arrays 
proposed in (Hayden et ah, 2009) also conhrms this result with a p-value of 0.2016. The 
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reason to explain this -a priori- unexpected result is that, if differences between the groups 
exist, they are probably hidden by a main group of genes with similar behavior between 
the groups. To validate this hypothesis, we hrst rank the set of 675 genes in terms of their 
individual variation between groups Gi and G 2 - To do so, we use the p-values of individual 
difference mean T-tests. Then, we consider the top-50 ranked genes and we apply the a-level 
set distance test. The obtained p-value is 0.010, indicating a signihcant difference in gene 
expression of the top-50 ranked genes. In Figure we show the estimated density prohles of 
the patients using a kernel estimator. It is apparent that the prohles between groups are dif¬ 
ferent as it is rehected in the obtained results. In Figure we show the heat-map calculated 
using the selection of 50 genes. Note that a hierarchical cluster using the Euclidean distance, 
which is the most used technique to study the existence of groups in micro-array data, is not 
able to accurately rehect the existence of the two groups even for the most inhuential genes. 

To conclude the analysis we go further from the initial 50-genes analysis. We aim to 
obtain the whole set of genes in the original sample for which differences between groups 
remain signihcant. To do so, we sequentially include in the hrst 50-genes sample the next- 
highest ranked genes and we apply to the augmented data sets the LS distance based test. 
The p-values of such analysis are shown in Figure Their value increases as soon as more 
genes are included in the sample. With a type-I error of 5%, statistical diherences are found 
for the 75 hrst genes. For a 10% type-I error, with the hrst 110 genes we still are able 
to hnd diherences between groups. This result shows that diherences between “early” and 
“late” recovery trauma patients exist and they are caused by the top-110 ranked genes of 
the Ahymetrix U133-I-2 micro-arrays (hltered by the query “inhammatory”). By considering 
each patient as a probability distribution the LS distance has been used to test diherences 
between groups and to identify the most inhuential genes of the sample. This shows the 
ability of the new proposed distance to provide new insights in the analysis of biological 
data. 


5 Conclusions 

In this paper we presented probability measures as generalized functions, acting on appro¬ 
priate function spaces. In this way we were able to introduce a new family of distances for 
probability measures, based on the evaluation of the PM functionals on a hnite number of 
well chosen functions of the sample. The calculation of these PM distances does not require 
the use of either parametric assumptions or explicit probability estimations which makes 
a clear advantage over most well established PM distances, such as Bregman divergences, 
which makes a clear advantage over most well established PM distances and divergences. 
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A battery of artificial and real data experiments have been used to study the performance 
of the new family of distances. Using synthetically generated data, we have shown their 
performance in the task of discriminating normally distributed data. Although the generated 
data sets are ideal for the use of T-statistics the new distances shows superior discrimination 
results than classical methods. Similar conclusions have been obtained when the proposed 
distances are used in homogeneity test scenarios. Regarding the practical applications, the 
new PM distances have been proven to be competitive in shape recognition problems. Also 
they represent a novel way to identify genes and discriminate between groups of patients in 
micro-arrays. 

In the near future we will afford the study of the geometry induced by the proposed mea¬ 
sure and its asymptotic properties. It is also interesting to investigate in the relationship 
there may be exist between the proposed distance and other probability metrics. 
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ECO2012-38442, SEJ2007-64500, MTM2012-36163-C06-06, DGUCM 2008/00058/002 and 
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Appendix 

A) Proof for Theorems of Section 3 

of Theorem 1. Consider x G Supp(P); given m and a sequence cc™, x G Ai(P) = Saffp) — 
Sai+iifw) for one (and only one) i G {l,...,m — 1}, that is ctj < /p(a;) < aj+i. Then 
0i,p(a^) = l[Ai(P)](a^) = 1 in the region Aj(P) and zero elsewhere. Given e > 0, choose m > ^ 
and ttj+i = ai + Given that a* < fp{x) < aj+i, then \ai — fp{x)\ < and thus: 

\fm-l{x) - fp{x)\ = 


That is lirnm-s-oo fm-i{x) = fp{x) pointwise and also uniformly by Dini’s Theorem. Therefore 
we can approximate (by hxing m ^ 0) the density function as a simple function, made up of 
linear combination of indicator functions weighted by coefficients that represents the density 
value of the a-level sets of the density at hand. □ 


m—1 


(5.1) 


loi - Mx)\ < — < £. 

m 
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of Corollary 1. By Theoremj^we can approximate (by fixing m 3> 0) the density fnnction 


as: 


m—1 


of Proposition 1. If lim da,i 3 {^, 

m^oo ’ 


/p(^) ~ 

i=i 

where aj = (0j,/p) and (j)j is the indicator fnnction of the Oj-level set of P. Then if 
/p) -1 {(f) j, /q), for all the indicator fnnctions (f)j, then /p = /q. □ 

= 0, then 
/l,(P) /l,(Q) 'ii 

. Thns /m(x) = = /m(^) ^^^1 by Theorem 1 /p = fq. In the other way 

aronnd if P = Q (/p = /q) then it is certain that lim (iQ.,^(P, Q) = 0. □ 


B) Invariance under affine transformations 

Lemma 1. Lebesgue measure is equivalent under affine transformations. 

Proof. Let X be a random variable that take valnes in distribnted according to P, and 


let /p be its density fnnction. Let T : 


-)■ 


be an affine transformation, define the r.v. 


X* = T{X) = a + bRX, where a G b G and R G is an orthogonal matrix with 
det(i?) = 1 (therefore R~^ exist and R~^ = Rf^). Then X* is distribnted according to /p*. 
Define E* = {x*\x* = T{x) and x E E}, then: 


y*{E*) = 


dF* 


fp*{x*)dx* 




(5.2) 


dT-\x*) 


dx* 


dx* 




/ My)dy= / dF = y{E). 
'e Je 


□ 


Theorem 2. (Invariance under affine transformation) The metric proposed in Eg. 


(3.1) is invariant under affine transformations. 
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Proof. Let T be an affine transformation, we prove that the measnre of the symmetric differ¬ 
ence of any two a-level sets is invariant nnder affine transformation, that is: p(Aj(P) A74j(Q)) = 
fi* (T(A(P))AT(A,(Q))) = /r*(A(P*)AA,(Q*)). By Lemma 1: 

p*(Ai(P*)AA(Q*)) = f dF*+ [ dQ* 

J Ai(P*)-Ai('Q*) J Ai{Q*)-A,{¥*) 


dF- 


JAi(P)-Ai{Q) 

= fi{Ai{F)AAi 


)-Aim 


The same argnment can be applied to the denominator in the expression given in Eq. (3.1), 
thns 


Wi 


Wi 


lj.*{Ai{F*) U A(Q*)) /i(Ai(P) U Ai 
for i = 1,... ,m — 1. Therefore as this is trne for all the a-level sets, then the distance 


proposed in Eq. (3.1) is invariant nnder affine transformations: 

m—1 

<p(P*,Q*) = 


2=1 

m—1 

= y] Aid((/>i(F),(/)^ 

2=1 

= dQ„/3(P,Q). 


□ 


C) Details about the estimation of level sets 

Definition 4 (Neighbonrhood Measnres). Consider a random variable X with density func¬ 
tion f{x) defined on IR'^. Let Sn denote the set of random independent identically distributed 
samples of size n (drawn from f). The elements of Sn take the form Sn = {xi,--- ,Xn), 
where Xi G IR'^. Let M : IR'^ X Sn —> ^ be a real-valued function defined for all n G IN. (a) 
U f{^) < f{y) implies lim P{M{x, Sn) > M{y,Sn)) = 1, then M is a sparsity measure. 

n—^oo 

(b) If f{x) < f{y) implies lim P{M{x,Sn) < M{y,Sn)) = 1, then M is a concentration 

n^oo 

measure. 

Example 1. Consider the distance from a point x G IR'^ to its -nearest neighbour in Sn, 
xiC: M{x,Sn) = dk{x,Sn) = d{x,x^’‘'>): it is a sparsity measure. 

Theorem 3. The set = {x : hn{x) = sign{p*^ — gn{x)) > 0} converges to a region of the 
form Saif) = {^\fi^) ^ Q:}? such that P(S'q,(/)) = 1 — z/. Therefore, the Support Neighbour 
Machine estimates a density contour cluster Saif) (around the mode). 
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Theorem 1^ (see details in (Munoz and Moguerza, 2006, 2005)) ensures the convergence 
of the empirical estimation of the proposed distance. When the sample size increases, we are 
able to determine with more precision the sets Ai(P) and Aj(Q) and therefore dQ,,^(P, Q) A- 


d. 




D) Extensions of the weighting scheme 

We present in this appendix alternative weighting schemes to LS(1): 


1. Radius of the set Aj(P) A74i(Q): Choose Wi in Eq. (3.1) by: 


Wi = — max 


-^%{P)-'’Ai(Q)^^’ II ^ ^ II2 


2. Hausdorff distance between the sets 


A(P)-/l,(Q) 


and 




A.(P) : 


Choose Wi in Eq. (3.1) by: 


f I i 


where H{X,Y) denotes the Hausdorff distance (hnite size version) between hnite sets 
X and Y (which estimates the ‘theoretical’ Hausdorff distance between space regions 
X and Y). In this case X = Hj(P) — Hi(Q) and Y = Hj(Q) — Hj(P). 
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